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, ABSTRACT 



It is well known that the clustering of galaxies depends on galaxy type. Such relative bias 
complicates the inference of cosmological parameters from galaxy redshift surveys, and is 
a challenge to theories of galaxy formation and evolution. In this paper we perform a joint 
counts-in-cells analysis on galaxies in the 2dF Galaxy Redshift Survey, classified by both 
colour and spectral type, 77, as early or late type galaxies. We fit three different models of 
relative bias to the joint probability distribution of the cell counts, assuming Poisson sampling 
of the galaxy density field. We investigate the nonlinearity and stochasticity of the relative 
bias, with cubical cells of side 10 Mpc ^ L $C 45 Mpc (h = 0.7). Exact linear bias is ruled 
out with high significance on all scales. Power law bias gives a better fit, but likelihood ratios 
prefer a bivariate lognormal distribution, with a non-zero 'stochasticity' - i.e. scatter that may 
result from physical effects on galaxy formation other than those from the local density field. 
Using this model, we measure a correlation coefficient in log-density space (t-ln) of 0.958 
for cells of length L = 10 Mpc, increasing to 0.970 by L — 45 Mpc. This corresponds to a 
stochasticity aj,/b of 0.44 ± 0.02 and 0.27 ± 0.05 respectively. For smaller cells, the Poisson 
sampled lognormal distribution presents an increasingly poor fit to the data, especially with 
regard to the fraction of completely empty cells. We compare these trends with the predictions 
of semianalytic galaxy formation models: these match the data well in terms of overall level 
of stochasticity, variation with scale, and fraction of empty cells. 

Key words: galaxies: statistics, distances and redshifts - large-scale structure of Universe - 
methods: statistical - surveys 
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1 INTRODUCTION 

The question of whether galaxies trace the matter distribution of the 
universe has many implications for cosmology and galaxy forma- 
tion theories. Since Hubble & Humason (1931) it has been known 
that galaxies of different type cluster differently, and as such it 
cannot be possible for all galaxies to trace the matter distribution 
exactly. This observation has been reconfirmed many times, tra- 
ditionally by comparisons of the correlation functions of different 
subgroups. For example, early type (or passive) galaxies are more 
strongly clustered than late type (or actively star-forming) galaxies 
(e.g. Davis & Geller 1976; Dressier 1980; Lahav, Nemiroff & Piran 
1990; Hermit et al. 1996; Norberg et al. 2002; Zehavi et al. 2002; 
Madgwick et al. 2003) and luminous galaxies cluster more strongly 
than faint galaxies (e.g. Willmer, da Costa & Pellegrini 1998; Nor- 
berg et al. 2001, 2002; Zehavi et al. 2002, 2004). 

Any difference in the distribution of galaxies relative to mass 
has become known as galaxy bias. This assumed a central impor- 
tance in cosmology via the attempts to rescue the Q, m — 1 universe 
after observations of cluster mass-to-light ratios suggested values 
closer to Q. m = 0.2. Such bias could occur if the galaxy forma- 
tion efficiency were increased in overdense regions of space, the 
so called 'high peak scenario' (Davis et al. 1985; Bardeen et al. 
1986). Although these efforts ultimately proved fruitless, under- 
standing of bias remains important. In recent years much effort has 
been put into investigating galaxy bias through theory and numer- 
ical modelling, while observational results have been restricted by 
small survey volumes. With the advent of large galaxy redshift sur- 
veys such as the 2dF Galaxy Redshift Survey (2dFGRS: Colless et 
al. 2001, 2003), the Sloan Digital Sky Survey (SDSS: Strauss et 
al. 2002) and the Deep Extragalactic Evolutionary Probe (DEEP: 
Davis et al. 2002) it is becoming possible to quantify the galaxy 
distribution as never before, and provide detailed descriptions with 
which to compare theoretical and numerical models. 

In principle the form of bias should be derivable from the fun- 
damental physical processes involved in galaxy formation; until 
we understand these, bias remains a description of our ignorance. 
The simplest model of galaxy biasing is the linear biasing model: 
<5g(x) = b8 m (x) where S g is the galaxy overdensity perturbation, 
8 m the mass overdensity perturbation and b a constant bias param- 
eter. This model is unphysical for b > 1, as it allows negative 
densities. Alternative models in the literature fall into several ba- 
sic classes: linear or non-linear, local or non-local, deterministic 
or stochastic. Locally biased galaxy formation (e.g. Coles 1993; 
Scherrer & Weinberg 1998; Fry & Gaztanaga 1993) depends only 
on the properties of the local environment, and the galaxy density 
is assumed to be a universal function of the matter density: 

S g = f(Sm)- (1) 

Because galaxies are discrete objects, this prescription is normally 
supplemented by the Poisson Clustering Hypothesis, in which 
galaxies are modelled as random events, whose expectation number 
density is specified via 8 3 . This model for discreteness can only be 
an approximation, but there is no simple alternative. We therefore 
assume Poisson sampling in what follows; for consistency, theoret- 
ical predictions are treated in the same way as the real data. 

Non-local models (e.g. Bower et al. 1993; Matsubara 1999) 
arise when the efficiency of galaxy formation is modulated over 
scales larger than those over which the matter moves, for exam- 
ple by effects of quasar radiation on star formation. Stochastic 
bias (Pen 1998, Dekel & Lahav 1999, hereafter DL99) allows a 
range of values of 8 3 for a given S m , above the Poissonian scat- 



ter caused by galaxy discreteness. Stochasticity is a natural part of 
non-local models, but some stochasticity is always expected to arise 
from physical processes of galaxy formation (Blanton et al. 1999). 
Throughout this paper we follow the general framework for non- 
linear stochastic biasing of DL99, in which the overdensity of one 
field can be related to that of a second field contained in the same 
volume of space through 

Si = b(6 2 )5 2 + e. (2) 

The scatter (stochasticity) in the relation is given by 

e = <5i-(<Si). (3) 

In principle the bias parameter b can be any function of 82 ; a con- 
stant value of b and e = represents deterministic linear biasing. 

Galaxy bias is clearly of astrophysical interest in relation to 
an understanding of galaxy formation and evolution. Bias is also 
a major practical source of uncertainty in deriving cosmological 
constraints from galaxy surveys. Some particular examples are the 
measurement of {3 — Q. m 6 /b (Peacock et al. 2001; Hawkins et al. 
2003), where DL99 showed that stochastic effects could explain 
large discrepancies between results from different methods (for a 
review see Dekel & Ostriker 1999, Table 7.2). Power spectrum 
measurements require constant bias as a fundamental assumption 
(Percival et al. 2001), and constraints placed on neutrino mass also 
assume scale independent biasing (Elgar0y & Lahav 2003). Pen 
(1998) calculate the effect of nonlinear stochastic bias on the mea- 
surement of the galaxy power spectrum on large scales, showing 
how the galaxy variance, bias and galaxy-dark matter cross corre- 
lation coefficient can be calculated from velocity distortions in the 
power spectrum. The importance of biasing has increased still fur- 
ther with the release of the WMAP first year results (Spergel et al. 
2003). In order to combine CMB and 2dFGRS data to give tighter 
constraints on cosmological parameters, a model for galaxy bias is 
required (Verde et al. 2003). 

Three independent methods have been used to investigate 
galaxy biasing in the 2dFGRS catalogue. Lahav et al. (2002) com- 
bined pre-WMAP CMB and 2dFGRS datasets to measure the aver- 
age bias over a range 0.02 < k < 0.15 h Mpc -1 , concluding that 
galaxies are almost exactly unbiased on these scales. Verde et al. 
(2002) found the bias parameter to be consistent with unity over 
scales 0.1 < k < 0.5 h Mpc -1 through measurements of the 2dF- 
GRS bispectrum. 

A more direct method of studying the relation between mass 
and light is to map the dark matter using gravitational lensing. This 
field has made great progress in recent years, and it has been pos- 
sible to measure not only the absolute degree of bias, but also its 
nonlinearity and stochasticity (Fischer et al. 2000; Hoekstra et al. 
2002; Fan 2003; Pen et al. 2003). For example, Hoekstra et al. 
(2002) combine the Red-Sequence Cluster Survey and VIRMOS- 
DESCART survey to find an average bias b — 0.71 and linear cor- 
relation coefficient of r ~ 0.57 on scales of 1 — 2 hT 1 Mpc. How- 
ever, current weak lensing measurements are dominated by non- 
linear and quasi-linear scales in the power spectrum, and it is not 
yet possible to say a great deal about bias in the very large-scale lin- 
ear regime. This is of course the critical region for the interpretation 
of redshift surveys, where we want to know the relation between the 
power spectra of mass and light on > 100 Mpc scales. 

This question will be settled by future weak lensing surveys. 
In the meantime, we can address a related simpler problem: the rel- 
ative bias between subsets of galaxies. The morphological differ- 
ences between galaxies and the link to their environments has been 
discussed for many decades as a potential clue to the nature and 
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evolution of galaxy clustering (e.g. Spitzer & Baade 1951; Gunn & 
Gott 1972; Davis & Geller 1976; Yoshikawa et al. 2001). Modern 
galaxy redshift surveys allow us to split the galaxy population into 
a variety of subdivisions such as spectral type, colour and surface 
brightness. We can look at relative bias as a function of scale, and 
weighted by luminosity. This should yield important insights into 
the absolute degree of bias that may exist. Norberg et al. (2001) 
measured bias as a function of luminosity in the 2dFGRS, find- 
ing a bias relative to L* galaxies of b/b* — 0.85 + 0.15L/L*. 
We concentrate on the natural bimodality of the galaxy population, 
between red early types with little active star formation, and the 
blue late type population (e.g. Baldry et al. 2003). Lahav & Saslaw 
(1992) measured bias as a function of morphological type and scale 
using the UGC, ESO and IRAS catalogues. The Las Campanas 
Redshift Survey has already provided some observational evidence 
against the linear deterministic model from splitting galaxies by 
their spectral types (Tegmark & Bromley 1999; Blanton 2000), and 
we present here a more extensive analysis of this type. 

There are several complementary methods for the measure- 
ment of galaxy clustering, although most previous studies of the 
relative bias between galaxy types have concentrated on a rela- 
tive bias parameter defined as the square root of the ratio of the 
correlation functions for the types under study. Madgwick et al. 

(2003) used this method to measure the relative bias in the 2dF- 
GRS, finding b(passive/active) ranging from about 2.5 to 1.2 on 
scales 0.2 ft- 1 Mpc < r < 20ft~ 1 Mpc. However, even within 
such a large survey as the 2dFGRS the correlation functions be- 
come noisy beyond about 10 ft -1 Mpc. A second method is counts- 
in-cells, which can be directly related to the correlation function 
(Peebles 1980), and is optimised for the study of larger scales. It 
is this latter method that we employ in this paper. Conway et al. 

(2004) have also investigated the relative bias of different galaxy 
types using a counts-in-cells analysis of the 2dFGRS, but they use 
magnitude limited samples, and consider only deterministic bias 
models, whereas our present analyses use volume limited samples, 
and consider stochastic bias models. The counts-in-cells method 
has also been used to calculate the variance and higher order mo- 
ments of galaxy clustering in the 2dFGRS (Conway et al. 2004; 
Croton et al. 2004a,b; Baugh et al. 2004). 

Many theoretical results on biasing from numerical models 
have also been reported. There are two main approaches to mod- 
elling galaxy distributions: semianalytic (e.g. Kauffmann, Nusser 
& Steinmetz 1997; Benson et al. 2000; Somerville & Primack 
1999) and hydrodynamic (e.g. Cen & Ostriker 1992; Blanton et al. 
1999; Cen & Ostriker 2000; Yoshikawa et al. 2001). Comparisons 
are given by Helly et al. (2003) and Yoshida et al. (2002). Several 
studies have been made of galaxy biasing in these numerical sim- 
ulations (e.g. Somerville et al. 2001; Yoshikawa et al. 2001), but 
none provide results in sufficient detail to allow an easy comparison 
with the 2dFGRS. We therefore analyse a large new semianalytic 
calculation which is capable of yielding mock results that can be 
analysed in an identical manner to the real data. 

In this paper we concentrate on a few aspects of relative bias 
mentioned above, splitting galaxies by spectral type and colour. We 
investigate the nonlinearity, stochasticity and scale dependence of 
the biasing relation through comparison with three models. Section 
2 summarises the DL99 framework for biasing, and presents the 
bias models used in this paper. Section 3 describes the 2dFGRS 
catalogue, the derivations of the galaxies spectral types and colours 
and Section 4 explains the counts-in-cells method. In Section 5 we 
show the methods used for model fitting and error estimation. Sec- 



tion 6 gives the results and we compare our results with simulations 
in Section 7. 

Throughout, we adopt a cosmological geometry with Sl m = 
0.3, Sit, = 0.7 in order to convert redshifts and angles into three- 
dimensional comoving distances. We define our cells with ft = 0.7, 
and all cell lengths are quoted in Mpc, instead of the standard 

ft" 1 Mpc. 



2 MODELLING RELATIVE BIAS 

The simplest model for any bias (i.e. mass-galaxy, early-late, red- 
blue etc.) is that of linear deterministic bias: given a number of 
one type of object you can predict precisely (within Poisson errors) 
the number of the other type of object in the same region of space, 
and the relationship between the two numbers is linear. Recalling 
the familiar relation for the mass/galaxy distributions 8 g — b5 m , 
we can write <5l = bS E where <5e (<5l) denotes the overdensity of 
early (late) type galaxies in a volume of space. As described above, 
this empirical model can become unphysical in low density regions. 
Considering the complex processes involved in galaxy formation, 
it would be surprising to find linear deterministic biasing to be true 
in all cases. Any reasonable physical theory in fact predicts non- 
trivial mass/galaxy biasing (Cole & Kaiser 1989) and simulations 
can also find biasing to be a complicated issue particularly on small 
scales (Cen & Ostriker 1992; Blanton et al. 1999; Somerville et al. 
2001). 

We investigate two potential improvements to the linear de- 
terministic model. Firstly the bias could be nonlinear, and some 
nonlinearity is inevitable in order to 'fix' the unphysical proper- 
ties of the linear model. Secondly, there may exist stochasticity 
(scatter beyond Poissonian discreteness noise), due to astrophysical 
processes involved in galaxy formation. DL99 presented a general 
framework to quantify these different aspects of biasing, and the 
following Section summarises their results. 



2.1 A framework for nonlinear, stochastic bias 

We use the notation /(&) and /(<5l) to denote the one-point prob- 
ability distribution functions (PDFs) for the fractional density fluc- 
tuations of early and late type galaxies. The fields Se and <5l have 
zero mean by definition, and their variances are defined by 

= {St). (4) 

The joint underlying probability distribution of early and late type 
galaxies is given by 

/(&,&) = (5) 
= (6) 

Both equations (5) and (6) should give the same outcome, but we 
have chosen to work with equation (5) to avoid unphysical linear 
biasing. 

The natural generalisation of linear biasing is given by 
b(S E )S E = (<5l|<5e) = J f(SL\5 E )S L dS L . (7) 

There are several useful statistics that can be used to investigate 
independently the fraction of nonlinearity and stochasticity of a 
model or data. Firstly the mean biasing is defined by 
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(b{8v)5l) . 



the nonlinear equivalent of this is 

p _ (& 2 (fc)*j) 



(8) 



(9) 



In each case the denominator is assigned such that linear biasing 
reduces to b = b = b. The random biasing field is defined as 



= &h — (5l|<5e) 



(10) 



and the statistical character of the biasing relation can be described 
via its variance, the biasing scatter function 



o&(5e) 



<e 2 |<5 E ) 



The average biasing scatter is then given by 



2 _ {<?) 

= —T- 



(11) 



(12) 



The purpose of this parameterisation is to separate naturally 
the effects of nonlinearity and stochasticity, allowing them to be 
quantified via the relations 



nonlinearity = b/b 
stochasticity = Ob/b. 



(13) 
(14) 



There are two further useful relations that are often quoted in 
the literature as measures of the bias parameter and stochasticity: 
the ratio of variances 



, at 

Ovar — 

and the linear correlation coefficient 



{SeSl) 



(15) 



(16) 



Both 6 var and ri in can be written in terms of the basic parameters 
given above, and both mix nonlinear and stochastic effects. Non- 
parametric correlation coefficients can also be calculated directly 
from the data, although some method must be employed to account 
for shot noise. We refer the interested reader to DL99 for further 
details on the equations in this Section. 

Note that we work throughout with redshift-space overdensi- 
ties. Redshift-space distortions are dependent on galaxy type due 
to the different clustering properties of early and late type galaxies. 
On nonlinear scales the dominant effect is the finger-of-god stretch- 
ing, but on the scales of interest to this paper we expect the linear 
/3-effect to apply (Kaiser 1987). Averaging over all angles and in- 
cluding stochasticity between the galaxy and matter fields, we can 
write the redshift-space power spectrum, P s , as 



g = (l + fr0 + ^) 



(17) 



where /3 — £7^ 6 /&, b is the mass-galaxy bias, r is the linear mass- 
galaxy correlation coefficient and P r is the real-space power spec- 
trum (Pen 1998; Dekel & Lahav 1999). The /3-effect was measured 
for galaxies of different spectral class in the 2dFGRS by Madgwick 
et al. (2003), obtaining /3 L = 0.49 ± 0.13 and /fe = 0.48 ± 0.14. 
From these results and assuming r = 1 we obtain 



(18) 



Although this suggests the effect is not large and currently insignif- 
icant within the errors, it is clear that in the case of zero stochastic- 
ity, redshift-space distortions will work to reduce the difference in 
the measured clustering between types. However, including a value 
of r which is non-unity and dependent on galaxy type as suggested 
by simulations (e.g. Blanton et al. 1999), has a significant effect. 
For example, taking tl = 0.8 and te = 1.0 increases the relative 
distortion from 0.99 to 1.04, where tl (te) is the linear correlation 
coefficient between the mass and late (early) type galaxy fields. 

2.2 One point probability distribution function 

Given equation (5) we can split the model into two parts, firstly 
the distribution of early type galaxies per cell, and secondly the bi- 
asing relation connecting the two distributions (see the following 
Section). A standard description for the underlying probability dis- 
tribution of a galaxy overdensity, /(l + S), is lognormal (Coles & 
Jones 1991). Applying this for example to the early type galaxies: 



/(&) cWe = 



where 



1 



LU-EV 2-7T 



exp 



2w| 



dx 



x = ln(l + Se) + 



(19) 



(20) 



and u> E is tne variance of the corresponding normal distribution 
/[ln(l + 6)]: 



cul = ([ln(l + 5 B )] 2 }. 



(21) 



The offset w|/2 is required to impose (Se) = 0. If the lognormal 
distribution correctly describes the data, the variance of the over- 
densities, is related to the variance of the underlying Gaussian 
distribution by 



o"! = ($e) = exp[w|] ^ 1- 



(22) 



In Section 6.5 we show that fitting a lognormal distribution directly 
to the data does not yield quite the same values for cte and ctl as a 
direct variance estimate, but this does not affect our final results for 
stochasticity. On the largest scales, a lognormal distribution is com- 
pletely consistent with the 2dFGRS data, and it provides a transpar- 
ent and simple way to describe the density field. 

2.3 Biasing models 

2.3.1 Deterministic bias: linear and power law 

Firstly concentrating on deterministic bias, we can write the joint 
probability distribution function as 

/(<5l|&) =6 D (5 L -b(8E)8E) (23) 

where S D is the Dirac delta function. This reduces directly to linear 
bias by setting 



6(<5e)& = 6o,c + £>i,£<5e 



(24) 



where the constraint (<5l) = fixes 6o,£ = 0. A simple variation 
could be power law bias 



6(<5e)& = b ,v(l + SE) bl - v - 1 



(25) 



which avoids the negative density predictions of linear bias, and re- 
duces to the linear biasing relation near 8 = 0. Rearranging equa- 
tion (25), using the properties of lognormal distributions and the 
fact that bi,-p = ljl/lje, we find for power law bias that 
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&o,p = exp[0.5a;|(&i,p — 6?,p)]. (26) 
For convenience we define 

&iin = bi,c (27) 
and 

& P ow = &i,-p (28) 
throughout the rest of this work. 

2.3.2 Stochastic bias: bivariate lognormal 

Returning to equation (7), we can introduce a broader function than 
the Dirac delta function of equation (23). An interesting class of 
model is when both 8 fields form a bivariate Gaussian distribution, 
but this again becomes unphysical for 5 < —1 (DL99). It is how- 
ever simple to cure this defect by assuming instead a bivariate log- 
normal distribution, for which the joint probability distribution is 
given by 



/(Se,Sl) 



-1/2 



2tt 



■ exp 



gk 2 + g~L 2 - 2r L N g~E gi) 



2(1 



<) 



, (29) 



where g>» = ln(l + 5i) — (ln(l + 5i)) and g t = gi/uii, with i 
corresponding to early or late type. io t is related to the variance 
of the underlying Gaussian field ln(l + 5;) as for the one-point 
lognormal distribution: 



a 2 = (51) = exp[^ 2 ] - 1. 



The correlation coefficient is 



r-LN 



2 

^EL 



and V is the covariance matrix 



V 



2 

^EL 



^EL 



(30) 



(3D 



(32) 



Taking /[ln(l + <5e)] to be a Gaussian of width uje and mean 
-oi|/2 (i.e. /(&) is distributed as a lognormal, equation [19]), the 



conditional probability distribution is 



/(Sl|se) = 



/(gE.ffQ 

/(Sb) 
(2tt|V|) 1 /2 



exp 



(gL - r L Nff~E) 2 

2(1 -r£ N ) 



(33) 



i.e. the distribution of <?~L|<fE is a Gaussian with mean txn <?e and 
variance 1 — r£ N . 

As t'ln — » 1, equation (33) reduces to a Dirac delta function, 
and this bivariate lognormal model reduces to the power law bias 
model of equation (25). It is important to note that tln is not equal 
to the linear correlation coefficient ru n of equation (16), which can 
differ from unity even if tln = 1. In this sense, the lognormal 
parameters offer a cleaner separation of stochastic and nonlinear 
effects. If stochasticity is present within the data, this model may 
provide an improvement over the deterministic biasing models. As 
observational data improve, it may become possible to constrain 
the relative biasing function to a greater extent; the current data are 
insufficient for such an analysis. 

Analytic solutions exist to the mean biasing parameters and 
biasing scatter function given in Section 2.1 for this bivariate log- 
normal model. These relations are presented in Appendix A. 



2.4 Including observational shot noise 

It is not possible to measure the underlying probability distribution 
/(<5b, <5l) directly due to contamination of the observational data 
with noise, the dominant form of which is expected to be Poisson 
or 'shot' noise. This can be included in the models of the previous 
Section by convolution with a Poisson distribution (Coles & Jones 
1991; Blanton 2000). In this way the measured probability of find- 
ing JVe early type galaxies and Nl late type galaxies within a cell, 
P(Ne, Nl,), can be compared with the models above. Accounting 
for shot noise in this way results in the models being less sensitive 
to outliers than equations (23) and (33). 

Using equation (5) to combine the one point PDF (19) with 
the conditional PDF (23 or 33), provides a model for the actual 
joint probability distribution function /(&, 5i). Convolution with 
a Poisson distribution then gives 



P(N E ,N L ) 



CO f GO 



Ne\ 



-JV L (1+5 L ) 



/(&) 



f(6 L \S E )dS E dS u (34) 



where Ne (Nl) is the expected number of early (late) type galaxies 
in a given cell, allowing for completeness. 



3 THE DATA: THE 2dFGRS 

The 2dF Galaxy Redshift Survey (2dFGRS), carried out between 
May 1997 and April 2002, has obtained 221,414 good quality 
galaxy spectra using the multi-object spectrograph 2dF on the 
Anglo-Australian Telescope. The main survey area comprises two 
rectangular strips of sky with boundaries (09 h 50 m < a < 
14 h 50 m ,-7.5° < S < +2.5°) for the NGP and (21 h 40 m < 
a < 03 h 40 m , -37.5° < <5 < -22.5°) for the SGP, with a galaxy 
median redshift of z = 0.11. At the median redshift, the physi- 
cal size of the survey strips is 375 hr 1 Mpc long and the SGP and 
NGP regions have widths of 75 hr 1 Mpc and 37.5 hr 1 Mpc re- 
spectively (Peacock 2003). The input galaxies were selected from a 
revised and extended version of the APM galaxy catalogue (Mad- 
dox et al. 1990), and have a limiting extinction corrected magni- 
tude of B, = 19.45. Further details of the 2dFGRS can be found 
in Colless et al. (2001), Colless et al. (2003) and on the web at 
http : / / msowww .anu.edu. au/2dFGRS/. 

For any structure analysis it is important to be aware of several 
problems that cause varying completeness over the survey region. 
Common to all similar surveys, some regions of the sky must be 
masked due to bright stars causing internal holes. Furthermore, due 
to the adaptive tiling algorithm employed to ensure an optimal ob- 
serving strategy, the sampling fraction falls to as little as 50% near 
the survey boundaries and internal holes due to lack of tile overlap. 
Subsequent reanalysis of the photometry of the APM galaxy cata- 
logue has shown the survey depth to vary slightly with position on 
the sky. To account for this, we use a limiting corrected magnitude 
of Bj = 19.2. 



3.1 The volume limited galaxy sample 

It is well known that the luminosity of a galaxy is correlated with 
galaxy type. Therefore in a flux limited sample the fraction of 
early/late types varies with redshift, potentially complicating the 
analysis. Within any redshift survey the number density of objects 
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Spectral type, (— j]) 

Figure 1. The distributions of spectral type and rest frame colour for all 2dFGRS galaxies. The distinction between passive and actively 
star-forming galaxies is clear in both distributions. Cuts at r\ = —1.4 and (B — R)o = 1-07 produce the four subgroups with which 
we work. 



drops substantially as we reach beyond the L, galaxy luminos- 
ity. The size of the 2dFGRS presents the option of studying vol- 
ume limited galaxy samples rather than the more usual flux limited 
datasets of previous galaxy redshift surveys. By imposing a lumi- 
nosity and redshift cut, volume limited samples contain a represen- 
tative sample of most galaxies over a large redshift range. Although 
some faint galaxies at low redshift are lost from the analysis, the 
sample selection effects are greatly simplified. 

We use the publicly released data of June 2003, containing a 
total of 221,414 unique galaxies with reliable redshifts, 192,979 of 
which have spectral classification. An absolute magnitude limit of 
Mb j — 51og 10 (7i) ^ —19.0 gives a representative sample of the 
local population, maximising the number of cells, versus the num- 
ber of galaxies in each cell. The absolute magnitude is given by 
Mbj = m — DM — K(z), where m is the apparent extinction cor- 
rected magnitude, DM is the distance modulus, and K (z) the K- 
correction. Setting a limiting survey magnitude of m = 19.2 allows 
for the varying survey depth with position on the sky (Colless et al. 
2001), the K-corrections as a function of r\ type are given in Madg- 
wick et al. (2002). This gives a maximum redshift for our sample 
set by Type 1 galaxies of 2 ma x = 0.114, and 48,066 galaxies in 
total. 46,912 of these have a spectral classification and 48,040 have 
measured colours. To account for the selection effects within the 
survey we use the publicly available redshift completeness masks. 
These are sufficient for galaxies classified by colour, but the spec- 
tral type analysis introduces extra selection effects. A difference in 
completeness over a region of sky could occur, for example, when 
the spectra of a survey plate are of too poor quality to perform the 
spectral type analysis, yet redshifts can be obtained. It is necessary 
to create separate masks to include these effects using the publicly 
available software. 



3.2 Galaxy properties in the 2dFGRS 

The 2dFGRS catalogue provides two methods of classification for 
comparison. Firstly the well studied galaxy spectral type r\, and 
secondly the photometric colours of the galaxies have recently been 
derived. 



3.2.1 Spectral type, r\ 

The spectral type of the galaxies has been derived by a Principal 
Component Analysis (PCA), which identifies the most variable as- 



-2-10 1 2 3 

Rest frame colour, (Bj - Rp)o 

Figure 2. The joint distribution of the colour and spectral types for galaxies 
in the 2dFGRS. 



pects of the galaxy spectra with no prior assumptions or template 
spectra (Folkes et al. 1999; Madgwick et al. 2002). The spectral 
type of the 2dFGRS galaxies is characterised by the value rj, a lin- 
ear combination of the first two principal components, derived in 
order to minimise the effect of distortions and imperfections in the 
2dFGRS spectra. In effect, r\ classifies galaxies according to the 
average emission and absorption line strength in the spectrum. r\ 
provides a continuous classification scheme, but for our purposes 
it is necessary to split the galaxies into two classes, at r\ — —1.4 
as suggested in Madgwick (2003). Galaxies with r\ < —1.4 (Type 
1) are shown to be predominantly passive galaxies and those with 
77 > —1.4 (Types 2-A) predominantly star-forming (Madgwick 
et al. 2003). The former are hence termed 'early type' and the latter 
'late type'. The 2dFGRS catalogue contains 74,548 early type and 
1 18,424 late type galaxies defined in this way. 

One concern with using optical fibre spectra for this type of 
analysis are 'aperture effects', resulting from the fixed aperture of 
the fibres being smaller than the size of galaxies. This could result 
in, for example, only the bulge components of close spirals being 
observed. Such effects have been studied in detail by Madgwick 
et al. (2002), and no systematic bias found. One possible expla- 
nation for this is the poor seeing present at the Anglo-Australian 
Telescope, of order 1.5"-1.8", which will cause the fibre to av- 
erage over a large fraction of the total galaxy light in most cases 
and dilute aperture effects. An overabundance of late type galaxies 
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Table 1. Numbers of early/late and red/blue galaxies in the 2dFGRS cata- 
logue with good quality spectra (Q >= 3). 



red/early blue/late Total 



colour 

V 



77,120 
74,548 



144,292 
118,424 



221,414 
192,979 



was detected at redshifts beyond 0.11, which could be attributed 
to either aperture effects or evolution; however, this will not af- 
fect our volume limited galaxy sample with a maximum redshift 
of Zmax = 0.114. Aperture effects are discussed further in Section 
6.2.4. 



3.2.2 Broad-band colours 

More recently it has been possible to obtain broad-band colours for 
the 2dFGRS galaxies using the same £>j UKST plates as the sur- 
vey input catalogue (Hambly et al. 2001), but now scanned with 
the SuperCosmos machine to yield smaller errors of about 0.09 
mag per band. Similar scans have also been made of the UKST 
Rf plates. The extinction corrections are from the dust maps of 
Schlegel, Finkbeiner & Davis (1998) and wavelength dependent 
extinction ratios are from Cardelli, Clayton & Mathis (1989). We 
define rest frame colour 



(B - R) = B, - R F - K(B,) + K(R F ) 
where the colour-dependent K corrections are 

K (Bj) =(-1.63 + 4.53 x C) x y 

+ (-4.03 - 2.01 x C) x y 2 

z 

" (i + (io*) 4 )' 

K(R F ) =(-0.08 + 1.45 x C) x y 

+ (-2.88- 0.48 x C) x y 2 , 



(35) 



(36) 



(37) 



with y = 2/(1 + z) and C = B, - R F . See Cross et al. (2004) 
for more details. A division at (B — R)o = 1.07 achieves a similar 
separation between 'passive' and 'actively star forming' galaxies 
to spectral classification of Type 1 to Types 2-4, giving a total of 
77,120 red galaxies and 144,292 blue galaxies. 

The distributions of r\ type and colour for the 2dFGRS galax- 
ies are shown in Fig. 1, and the joint distribution is shown in Fig. 2. 
The correlation between the two properties is clear, together with 
the distinct bimodality, yet it is obvious that the relationship is not 
exactly one-to-one. Table 1 gives the respective numbers of each 
galaxy type in the 2dFGRS catalogue for comparison. 



4 METHOD: COUNTS-IN-CELLS 

A counts-in-cells analysis is employed, which involves splitting the 
survey region into a lattice of roughly cubical cells and counting 
the number of galaxies in each cell. The cell dimensions are defined 
such that all have equal volume V = L [i , but with limits to right as- 
cension and declination that form a square on the sky. This angular 
selection simplifies the treatment of the survey mask, but it means 
that the cells are not perfect cubes. Over the redshift range involved, 
this effect is small. The cells are required to fit strictly within the 
2dFGRS area, causing some parts of the survey to be unused. Al- 
though this restriction in principle removes any boundary effects, 



it means that cells of different sizes sample slightly different ar- 
eas of the universe. We define our cells with h — 0.7 and in what 
follows all cell lengths are quoted in Mpc, instead of the standard 
h^ 1 Mpc. In practice, we considered 10 Mpc < L < 45 Mpc, 
giving a total of between 1 1,423 and 72 cells in the volume limited 
survey area after removing low completeness cells. These cell sizes 
are equivalent in volume to using top hat smoothing spheres with 
radii 6.1 h^ 1 Mpc < r < 27.9 Mpc. Fig. 3 shows an example 
of how 25 Mpc cells cover the 2dFGRS volume to z = 0.11. 

Due to internal holes in the survey and the adaptive tiling 
algorithm employed, the sampling fraction in the 2dFGRS varies 
over the sky. Random 2dFGRS catalogues can be created, which 
include these selection effects by making use of the calculated sur- 
vey masks. Each cell count is weighted by the fraction of random 
points found in the same cell in the mock catalogue. The spectral 
type analysis introduces extra selection effects, which are quanti- 
fied by a special mask (see Section 3.1). 

An overdensity 8 is calculated for each cell by dividing the 
observed cell counts N by the expected number for a given cell 
allowing for completeness, N: 



5l= N 



1. 



(38) 



This procedure is carried out for both early and late type galaxies 
within each cell. The overall density variance is defined by equation 
(4). 

It is necessary to set a completeness limit to remove exces- 
sively under-sampled cells, such as those affected by holes in the 
survey due to stars, or cells at the less observed edges of the survey 
volume. Although the limits are somewhat arbitrary, it is important 
they are set correctly as incomplete cells could affect our measure- 
ments of scatter in the biasing relation. Fig. 4 shows the complete- 
ness distributions for L = 25 Mpc cells. It can be seen that for 
both j) and colour classification the distribution has a sharp peak of 
almost complete cells, with a long tail to low completeness and a 
sharp cut off at high completeness. The figure also highlights the 
importance of including the effects of rj classification on the 2dF- 
GRS mask as the completeness peak and cut off is noticeably lower 
for r\ classification than for all galaxies. The lower completeness is 
reflected in the reduced number of cells available for analysis. 

A completeness limit is set for each cell at 70% (or 60% for the 
larger cells), to include all cells within the high completeness peak. 
In order to check the effects of completeness on the final results, 
the models were also fit to only those cells with completenesses 
higher than 80% (70% for the larger cells), and the results found to 
be consistent within the errors. 

A general concern with a counts-in-cells analysis of observa- 
tional data is the varying survey selection function over a cell's 
extent. For example, a cell containing a cluster of galaxies at its 
most distant edge, and weighted by the average selection function 
over its volume, would give a different 'count' to a cell containing a 
cluster near its inner boundary. Furthermore, with a joint counts-in- 
cells analysis any relationship between luminosity and galaxy type 
or colour will cause differing fractions of objects with redshift. 

With careful use of type-dependent selection functions such 
effects can be allowed for (Conway et al. 2004), but in our anal- 
ysis we use volume limited samples. This approach avoids these 
complications at the expense of reducing the number of galaxies in 
the analysis. The size of the 2dFGRS offers a great advantage over 
previous observational studies of relative bias, because it provides 
volume limited samples large enough to produce reliable measure- 
ments. 
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Figure 3. Wedge plots of the 2dFGRS volume limited survey region with Ms, — 5 log 1() (?i) ^ —19.0. Dots represent late type galaxies on the left 
and early type galaxies on the right (classified by spectral type). Redshift increases from the centre, and right ascension is shown on the horizontal axis, 
declination is projected onto the plane. Typical cell boundaries of length L = 25 Mpc are overplotted. 



5 PARAMETER FITTING 

5.1 A maximum likelihood approach 

Once we have chosen a model, a maximum likelihood method is 
used to fit the free parameters of the model to the data. Denoting 
the number of early (late) type galaxies within cell i as Ne,% (Nh,i), 
the likelihood of finding a cell containing Ne early type and Nl 
late type galaxies given a model with free parameters a, is defined 
as 



LiiN^N^a) =P(N E , i ,Nj J , i \a) 
and the total likelihood for all cells is then 



(39) 



(40) 



The likelihood can be maximised with respect to the free param- 
eters a to find the best fitting values a for the model given the 
dataset. In practice it is easier to minimise the function 



E 



In Li. 



(41) 



Note that this definition of C differs by a factor of two compared to 
Conway et al. (2004). 

The models in Section 2 contain two or three free parame- 
ters: <te and/or ul from the one point PDFs, and b or tln from 



the conditional probability function. These parameters were fitted 
simultaneously to the data using a downhill simplex method (Press 
et al. 1992). 



5.2 Error estimation 

As it is not possible to derive analytic solutions to the sampling 
distribution of our maximum likelihood estimators a, the standard 
error on our parameters must be estimated directly from the likeli- 
hood function using Bayes' theorem and assuming a uniform prior 



P(a\x) oc L(x; a) 



(42) 



where a. again denotes the model parameters, x the data, and P the 
probability. 

For a single free parameter, the upper and lower limits on a 
are found from 



P(a- < a < a+\x) 



JT_ + L{x;a)da 



(43) 



If it can be assumed that the likelihood function is reasonably ap- 
proximated by a Gaussian, la errors on the parameter can be esti- 
mated. For multi-parameter models it is necessary to quantify any 
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Figure 4. Histograms showing the completeness distribution of 25 Mpc 
cells using the standard 2dFGRS mask (top), and including the effects of r\ 
classification (bottom). 



possible degeneracy between errors. If the multi-dimensional like- 
lihood function can be approximated by a multivariate Gaussian 
distribution, individual errors and correlations between the param- 
eters can be found. 

A second method of error estimation involves creating many 
mock datasets from the fitted model probability distributions them- 
selves. These datasets are made through Monte Carlo techniques, 
and designed to closely reproduce the true data in size. On apply- 
ing the above likelihood techniques to these mock datasets, the best 
fit and true parameters can be compared to estimate the errors. The 
advantage of this method is that no assumptions need to be made 
about the shape of the likelihood function. The disadvantage is that 
we are assuming that the model is a correct representation of the 
data, as the errors strictly apply only to the model not the data. 
By increasing the size of the mock datasets, this method can also 
be used to check for any bias inherent in the fitting method. This 
process was carried out for each model in this paper, finding the 
parameter estimations to be unbiased. 

In all cases, we will make the assumption that the density fluc- 
tuations in each cell can be treated as independent. This is clearly 
not true in detail, since the existence of modes with wavelength 
> L will cause a correlation between nearby cells. This was con- 
sidered by Broadhurst, Taylor & Peacock (1995), who showed that 
the correlation coefficient was low even for adjacent cells: r ~ 0.2. 
As we shall see, it is (1 — r 2 ) 1 ^ 2 that matters for joint distributions, 
and so the failure of independence is negligible in practice. 



5.3 Model comparison 

Once we have found the best fit parameters for each of our three 
models, we would like to know the goodness-of-fit of the models 
and the significance of any differences between the fits. We ap- 
proach this using two different methods. 



5.3.1 Likelihood ratios 

To test the significance of one model against another model we use 
the likelihood ratio test. In its simplest form we define the maxi- 
mum likelihood ratio for hypothesis Ho versus Hi 



LjxjHp) 



(44) 



where x is the data, and L represents the maximum likelihood 
value. This will be especially valuable in assessing the evidence 
for stochasticity, where we will compare a model of perfect corre- 
lation with one where r L N ^ 1 is allowed, effectively introducing 
an extra parameter. The key question is how large a boost in likeli- 
hood is expected from the introduction of an extra parameter, and 
this was considered by Liddle (2004). He advocates the Bayesian 
Information Criterion, defined as 



-21nL + p]nN, 



(45) 



where p is the number of parameters and A is the number of 
data points. This measure of information effectively says that go- 
ing from a satisfactory model with p parameters to one that over- 
fits with p + 1 parameters would be expected to increase In L by 
0.5 In A. Therefore, in order to achieve evidence in favour of the 
increase to p + 1 at the usual 5 per cent threshold, we require 



A In L = - In 0.05 + 0.5 In A, 



(46) 



which is between 5 and 8 for the number of cells considered here. 
An unequivocal detection of stochasticity thus apparently requires 
a likelihood ratio between tln 1 and the best 7~ln = 1 model in 
excess of A ~ exp(5) to exp(8). 

Monte Carlo simulations may be used to check the validity of 
this analytic method. This is computationally expensive, so only an 
upper limit may be set on the significance of an observed likeli- 
hood ratio. We create 40 mock datasets following a power law bias 
model convolved with a Poisson distribution, defining the mean 
cell counts, number of cells, one point PDF fit parameter <te and 
model parameter b pow to emulate a range of datasets. To these we 
fit both power law and bivariate lognormal models with the usual 
maximum likelihood fitting procedure. This allows us to assess the 
largest likelihood ratio that should arise by chance if the true model 
is in fact perfect power law bias. The results suggest a substan- 
tially smaller critical value is required than equation (46), closer 
to A In L — 1 to reject the model at the 95% confidence limit. It 
therefore appears that the assumptions used to derive the Bayesian 
Information Criterion do not apply to this problem. 



5.3.2 Kolmogorov-Smirnov test 

Although the likelihood ratio test can eliminate one model in favour 
of another, it can not tell us how well the preferred model fits the 
data. A Kolmogorov-Smirnov (KS) test can be used to test for a 
difference between an observed and modelled cumulative proba- 
bility distribution. This test provides the probability that the data 
are drawn from the model probability distribution, with a low prob- 
ability representing a poor fit. A resulting probability above about 
0.1 is generally accepted as a reasonable fit, as the KS test is un- 
able to rule out the model being the true underlying distribution at 
greater than 90% confidence. Strictly the test becomes invalid once 
the data has been used to fix any free parameters of the model, as 
in this method (Lupton 1993). However, as long as the number of 
data points is much greater than the number of free parameters any 
effects should be small. 
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Table 2. Completeness limits, total number of cells and average cell counts 
for each dataset after corrections for completeness have been applied. Note 
that numbers do not scale exactly as L 3 due to edge effects. 





Cell size (Mpc) 


compl. 


no. cells 






Colour 


1 n 

1U 


n 7 
u. / 


1 1 All 
1 14ZJ 


1 c 

l.J 


l.o 






ft n 
U. 1 


in i o 


J. t 


O.J 




ZU 


U. / 


1 1 f\A 
1 1U4 


1Z. 1 


14.0 




25 


0.7 


484 


25.6 


30.1 




30 


0.7 


234 


41.3 


48.6 




35 


0.6 


169 


57.4 


70.7 




40 


0.6 


115 


88.3 


105.4 




45 


0.6 


72 


125.9 


149.2 


V 


10 


0.7 


9668 


1.9 


1.9 




15 


0.7 


2567 


6.5 


6.3 




20 


0.7 


930 


15.3 


14.7 




25 


0.7 


404 


32.1 


30.2 




30 


0.7 


187 


54.2 


49.6 




35 


0.6 


115 


71.8 


71.4 




40 


0.6 


74 


106.4 


108.7 



To transform our bivariate distribution to a ID variable on 
which we can perform the standard KS test, we create integrated 
probability distributions from both the model and the data, inte- 
grating within constant model probability contours centered on the 
position of maximum probability. This gives cumulative probabil- 
ity distributions for model and data from which the KS probability 
(that the data follow the same underlying distribution as the model) 
can be derived. 

The KS test has been generalised to bivariate analyses by Pea- 
cock (1983) and Fasano & Franceschini (1987). However, this 2D 
KS test was found to lack power compared to the previous method 
for the present application. 



6 RESULTS 

Table 2 summarises the details of each of our samples, including 
average count per cell, and completeness limit. The following Sec- 
tions look in detail at the univariate and bivariate model fits to each 
dataset. In Section 6.2.3 we investigate the scale dependence of 
nonlinearity and stochasticity in the 2dFGRS. In Sections 6.3 and 
6.4 we discuss the origin of the stochasticity and perform some 
consistency checks on the results. 

6.1 One point probability distributions 

Fig. 5 shows the bivariate distributions of cell counts, together with 
the one point distribution functions for a range of cell sizes. Before 
we consider the bivariate distributions further, we look in detail at 
the individual lognormal fits to these one point distributions. We fit 
a lognormal distribution convolved with Poisson noise to the early 
and late number counts individually, using the method described in 
the previous Sections. The best fitting lognormal models are shown 
overplotted in the Figure. It can be seen that on large scales the 
lognormal model alone fits the data well, but on small scales the 
deviation due to discreteness is substantial. For this reason it is im- 
portant to account for shot noise in the fitting procedure. 

In order to assess the Poisson sampled lognormal model quan- 
titatively, we create many Monte Carlo cells with best fitting pa- 
rameters and expected number counts to match each dataset. We 



allow for completeness effects by randomly assigning each cell 
a completeness value from a Gaussian distribution of width and 
mean equal to those of the dataset. Fig. 6 shows the distributions of 
early type galaxies and Monte Carlo cells with matching parame- 
ters. Overplotted are the best fitting lognormal model (dashed line) 
and a lognormal curve with variance derived directly from the data 
(dotted line, see Section 6.5). 

We can now compare our MC data with our true data through 
a KS test. For large cells (^ 25 Mpc) we find KS probabilities 
in excess of 0.8, but as cell size decreases the KS probabilities 
decrease. On the smallest scales of 10 Mpc we obtain KS prob- 
abilities of ~ 10~ 9 and it is this poor model fit that causes the 
lognormal model to overestimate variances in comparison with di- 
rect methods. The Figure shows there to be an excess of data cells 
with moderate overdensities compared to the best-fitting lognor- 
mal model, particularly on the smallest scales. In very underdense 
regions the Figure shows the dotted (direct variance) curve to lie 
below the dashed (fitted) curve. This results in an underprediction 
of the number of cells containing zero or one galaxy when using 
the direct variance method (as discussed in more detail by Conway 
et al. 2004) 



6.1.1 Failures of the Poisson sampled lognormal distribution 

The discrepancy between the observed and predicted distributions 
of cell counts shows that at least one of our two assumptions about 
the galaxy field is incorrect. The lognormal distribution is simply 
a convenient functional form which has been shown to fit galaxy 
distributions from previous surveys (e.g. Hamilton 1985; Kofman 
et al. 1994) and N-body matter distributions successfully (Kofman 
et al. 1994; Kayo, Taruya & Suto 2001, and references therein). De- 
viations from this simple model are evident in detailed numerical 
simulations (e.g. Bernardeau & Kofman 1995), and at some level at 
least we would expect to see such deviations in our data. Various al- 
ternative distributions have been suggested in the literature such as 
the skewed lognormal, negative binomial or Edgeworth expansions 
(see also Sheth, Mo & Saslaw 1994; Valageas & Munshi 2004). 

However, the fact that the model fails in detail on scales at 
which shot noise dominates the distribution in underdense regions 
suggests that the Poisson sampling hypothesis is at least partly to 
blame. By attempting to fit the model to these underdense cells, the 
variance is increased and the moderately overdense regions are no 
longer well fitted. On small scales the majority of cells contain zero 
or one galaxy, hence the preference of the model to fit these cells 
and not those containing more galaxies. 

We hope to explore more complex models for the count distri- 
bution elsewhere. For the present application, there are two points 
to make. The first is that cells of side about 10 Mpc are the smallest 
that can sensibly be discussed with this approach; reducing the cell 
size would lead to distributions that are dominated by discreteness 
effects. More importantly, it should be stressed that analytic mod- 
els of this sort are not really physical. In the end, what matters is 
whether the 2dFGRS data match the predictions of a proper calcu- 
lation of galaxy formation. We carry out such a comparison at the 
end of the paper, and the results of a Poisson-sampled lognormal 
fit are a convenient statistic to use for this purpose. Provided true 
data and mock data are treated identically, small imprecisions in 
the function used for the fit are irrelevant. 
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Figure 5. On the left, the bivariate counts-in-cells distributions with early and late type galaxies classified by colour. The points mark density values of 
individual cells, and from top to bottom L = 15, 25 and 35 Mpc cells are shown. ID projections of the distributions are shown for early types (centre) and 
late types (right), to which a Poisson sampled lognormal model has been fitted. The best-fitting lognormal curves are overplotted (dashed line). Due to the 
logarithmic axes, a bin for cells containing zero galaxies has been artificially positioned on the horizontal axis. Note the discreteness of the galaxy counts: the 
actual number of galaxies contained in the cells is indicated by the numbers over the ID distributions. Further note the survey completeness effects on smaller 
counts per cell, causing the spread of points around the mean value. Correcting zero counts for completeness is non-trivial and not included in this analysis, 
hence there is no spread of these points. 



6.2 Joint distributions and biasing models 

Each of the three models in Section 2 is fitted to the datasets de- 
scribed in Table 2. Best fit parameters for the models are estimated 
simultaneously through the maximum likelihood method of Sec- 
tion 5.1. Table 3 shows the best fitting parameter values for the two 
deterministic models, together with log-likelihood differences be- 
tween the model and the bivariate lognormal model. The values of 



bpow and bn n clearly show how early type galaxies are more clus- 
tered than late type galaxies, as is well known. Table 4 gives the best 
fitting parameter values and errors for the stochastic bias model. 
The quoted errors are determined through multivariate Gaussian 
fits to the likelihood surface, which were found to agree well with 
Monte Carlo error estimates. 

Fig. 7 shows the joint probability distribution of the data for 
L = 20 Mpc cells, together with Monte Carlo realisations of the 
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Figure 6. The univariate distributions of early type galaxies for L = 10, 15,20 and 25Mpc cells (empty, red histogram), together with the distribution 
of Monte Carlo cells (hatched, blue histogram) with parameters equal to those obtained from fitting a Poisson sampled lognormal distribution to the data 
cells. Completeness effects are modelled as a Gaussian with variance equal to that found in the dataset. The dashed curve shows this best-fitting lognormal 
distribution; the dotted curve shows the lognormal curve with variance equal to that measured directly from the data. Both variances are given in the upper 
right, together with KS probabilities that the Monte Carlo cells are drawn from the same distribution as the data cells. Due to the logarithmic axes, a bin for 
cells containing zero galaxies has been artificially placed on the horizontal axis. For L = 10 Mpc about 60% of the cells contain zero early type galaxies, and 
the vertical axis has been truncated to allow a better view of the remaining bins. Note the discreteness of the data and simulations, as discussed in the caption 
to Fig. 5. 



best fitting linear, power law and bivariate lognormal models for 
comparison. The Monte Carlo realisations include completeness ef- 
fects by randomly selecting a completeness value for each cell from 
a Gaussian with mean and width equal to that of the true distribu- 
tion of cell completeness. This scale is chosen to illustrate all the 
properties of the data and the figures clearly show the shot noise 
in underdense regions, together with the effects of survey incom- 
pleteness. By eye we can see the differences between the linear and 



power law bias models, and the effect of stochasticity in the bivari- 
ate lognormal model. The best fitting linear model has a mean bias 
closer to unity at high density, and cannot fit the nonlinearity seen 
in the data. The power law model corrects for this, but the scatter 
about the mean is insufficient to match the data. The stochastic- 
ity introduced by the bivariate lognormal model is evident and is 
matched well by the data. The likelihood ratios shown in Table 3 
quantify the differences and show that on all scales the bivariate 
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Figure 7. On the top left, the bivariate counts-in-cells distributions for 20 Mpc length cells, with early and late type galaxies classified by colour. The points 
mark density values of individual cells. The other three panels show Monte Carlo realisations of the best fitting linear, power law and bivariate lognormal 
models. The realisations are created to match the data as far as possible, with equal cell numbers and average number counts. Cell completeness is included 
by assuming the distribution of cell completeness to be a Gaussian of mean and width equal to that of the data cells. In each panel the dashed line shows the 
b = 1.0 case, and the dash-dot line shows the mean biasing of each model (for the top left plot, the dash-dot line shows the mean biasing of the best fitting 
bivariate lognormal model). Poisson sampling of the galaxies is assumed in all cases. Note that for all but 6 = 1.0, linear bias appears as a curve on the log-log 
plots. Due to the logarithmic axes, cells containing zero early or late type galaxies have been artificially positioned. 



lognormal model gives significantly better fits compared to deter- 
ministic biasing models. 

We now repeat this analysis, splitting galaxies by spectral type 
rj, rather than by colour. The colour split allows a larger sample of 
cells to be included in the analysis, but Fig. 2 shows that a division 
by spectral type does not always select the same galaxies as a colour 
split, so it is interesting to see how the results compare. The second 
section of Tables 2, 3 and 4 give details of the datasets and results 
of the model fits to cells with galaxies classified by rj. The joint 
distributions for 20 Mpc cells are shown in Fig. 8. 



Comparing the results for 20 Mpc cells, the results for colour 
and rj are generally similar. The likelihood ratios again favour the 
bivariate lognormal model over our two deterministic biasing mod- 
els, and suggest a slightly smaller difference between the stochastic 
and deterministic models than found in the colour dataset. This is 
verified by the smaller stochasticity found in the best fitting bivari- 
ate lognormal model at all scales. Unlike for the colour datasets, 
power law bias is only marginally inconsistent with the data on the 
largest scales studied here. This difference between colour and rj 
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Figure 8. Same as Fig. 7, with galaxies classified by r; type. 



type results may reflect a physical difference in the relative biasing 
relations, but firm conclusions are not yet possible. 



6.2.1 The goodness-of-fit statistics 

Table 3 shows the log-likelihood differences (— In A) of the param- 
eter fits, taking the bivariate lognormal model as our null hypoth- 
esis. In all cases the linear model provides a worse fit to the data 
than the power law model, and the power law a worse fit than the 
bivariate lognormal model. This latter statement may however be 
due to the addition of an extra free parameter to the model (Liddle 
2004). To establish the significance of the difference between the 
power law and bivariate lognormal model we can make use of the 
theory given in Section 5.3. For example, for L = 20 Mpc cells 
and assuming the Bayesian Information Criterion (equation 46), 



we would require a likelihood ratio in excess of oxp(6.5) to claim 
that the bivariate lognormal model provides a significantly better 
fit than the power law model with tln = 1. However, as stated 
earlier, Monte Carlo simulations of the power law model show that 
a less stringent likelihood ratio in excess of ~ exp(l) is all that is 
required. We measure a likelihood ratio of exp(38) for the dataset 
with galaxies classified by colour, and exp(18) for r\ classification, 
a highly significant result in both cases. 

Although the likelihood ratios favour of the bivariate lognor- 
mal distribution over the two deterministic models, they do not tell 
us how well the best-fitting distribution matches the data. For this 
we turn to the KS statistic described in Section 5.3.2. On scales 
L > 15 Mpc our KS statistic accepts the model with a probabil- 
ity greater than 0.5. On the smallest scales studied here we find 
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Table 3. The best fitting deterministic biasing models parameters to each dataset. The level of nonlinearity given the model is given by b/b, 
which is unity by definition for the linear model. The penultimate column shows the log-likelihood differences between the best fit linear or 
power law models and bivariate lognormal model. A positive value indicates the bivariate lognormal model is a better fit to the data. The final 
column shows how many cells must be removed to reduce the power law likelihood ratio to ~ exp(l) (Section 6.3). 
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Table 4. The best-fitting bivariate lognormal model parameters to each dataset. Errors are shown, derived from Gaussian fits to the parameter 
likelihood surface. A(rLN) is derived from propagation of A[(l — ^n) 1 ^ 2 ]- ^ ne remaining columns give the average biasing parameters. 
Appendix A gives the analytic solutions for each parameter in the case of the bivariate lognormal model. The final two parameters measure 
the nonlinearity and stochasticity of the model (equations 13, 14). 
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Figure 9. The bivariate counts-in-cells distributions for 15 (top), 25 and 35 Mpc length cells. The left hand panel shows the data with early and late galaxies 
classified by colour. Larger points indicate the cells identified as outliers from r L N = 1 (see Section 6.3). The central column shows a Monte Carlo simulation 
of the best fitting power law model and the right hand column the best fitting bivariate lognormal model. The dashed line indicates a mean biasing of b = 1.0, 
the dot-dash line shows the best fit mean bias. 



this probability to decrease, in line with the trend for the univariate 
lognormal distribution (Section 6.1). 

6.2.2 Stochasticity and nonlinearity 

In order to quantify the nonlinearity and stochasticity of the joint 
distribution of early and late type galaxies we assume that the bi- 
variate lognormal model is an accurate representation of the data. 
In our analysis the log-density correlation coefficient j"ln provides 
a complete measure of the stochasticity; to aid comparison with 
other work we compute the mean biasing, its nonlinearity and the 



average biasing scatter of equations (8) to (12). For clarity we con- 
centrate briefly on the results for 20 Mpc cells, shown in the third 
line of Table 4. These indicate that whilst the nonlinearity [equa- 
tion (13)] is only 1.03, the stochasticity [equation (14)] is 0.31. This 
high stochasticity is reflected in the deviation of tln from one, and 
the low linear correlation coefficient of rn n = 0.93. It is important 
to note that these statistics account for Poisson noise, as the models 
were convolved with a Poisson distribution before being fitted to 
the data. 

Table 3 shows for comparison some biasing statistics for our 
two deterministic models. It can be seen that a similar nonlinear- 
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Figure 10. The scale dependence of nonlinearity and stochasticity in the 2dFGRS. The solid line shows results for galaxies classified by colour and dotted 
line for galaxies classified by rj type. The circles show the two semianalytic datasets. Mock 1 is described in the text as the "superwind" model, and mock 2 as 
the "low-baryon" model. The open diamonds indicate values measured for the colour dataset using direct variance estimates, where they differ by more than 1 
sigma from the results derived from model fitting (see Section 6.5). For clarity, errors are omitted for the rj and second mock datasets. 



ity is measured by the power law model, whilst the linear correla- 
tion coefficient remains close to one, reflecting the inability of the 
model to measure stochasticity. The best fitting linear bias model 
has a mean biasing parameter closer to one, indicating that by as- 
suming this model previous studies may have underestimated the 
magnitude of relative biasing. 

It may be considered surprising that the correlation parameter 
r LN can be measured so precisely that r LN = 0.97 can be clearly 
distinguished from tln = 1. The reason for this can be seen by 
examining the expression for the bivariate lognormal distribution 
[equation (33)], in which the scatter in 5l at fixed 8e is propor- 
tional to S = y/l — r£ N . This is a more meaningful quantity than 
the correlation coefficient, but it lacks a standard name. In this con- 
text, the obvious term for S would be 'stochasticity', but this is 
already taken and we resist the temptation to expand the terminol- 
ogy further. The stretched nature of this measure of correlation is 



quite extreme (as noted independently by Seljak & Warren (2004)): 
S — 0.5 corresponds to tln = 0.87. Therefore, rLN = 0.87 is 
effectively half-way to no correlation at all. This is why even a cor- 
relation as high as tln = 0.97 is noticeably imperfect in terms of 
density-density plots. 

6.2.3 Scale dependence 

We now look at how our results depend on scale. This is interest- 
ing because it may potentially distinguish whether the efficiency of 
galaxy formation in a particular region of space is affected by local 
or non-local factors. Examples of local factors could be density, ge- 
ometry, or velocity dispersion of the dark matter. Non-local factors 
could involve, for example, effects of ionising radiation from the 
first stars or QSOs on galaxy formation efficiency, causing coher- 
ent variation over larger scales than possible from local factors. 
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Table 5. Colour and rj samples with cell length L = 20 Mpc are each split into two 
redshift groups at z = 0.09. This Table shows results for the bivariate lognormal 
model fit to each subsample. 
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Figure 11. The scale dependence of the ratio of variances , o V ar > 

and the 

linear correlation coefficient, r lin . Symbols as in Fig. 10. 



Fig. 9 shows the bivariate distributions for 15, 25 and 35 Mpc 
cells with galaxies split by colour. The likelihood ratio tests (Table 
3) show that the bivariate lognormal bias model provides a signif- 
icantly better fit to our data than both deterministic models on all 
scales for colour selection and all but the largest scales when clas- 
sifying by rj. 

The nonlinearity and stochasticity as a function of scale are 
plotted in Fig. 10, with errors derived by propagation from those 
shown in Table 4. The commonly quoted parameters 6 var and ru n , 
which combine both nonlinearity and stochasticity, are plotted in 
Fig. 11 to ease reference with results in the literature. On small 
scales (^ 20 Mpc) the average biasing statistics suffer systematic 
errors from overestimates of the variance by the Poisson sampled 
lognormal model fit, as discussed in detail in Section 6.5. To in- 
dicate the magnitude of these effects, open diamonds show results 
for the colour dataset using direct variance estimates, where the 
difference is greater than 1 sigma. It can be seen that although both 
nonlinearity and 6 V ar show noticeable change with scale, this can 
be mostly explained by the poor fit of the model. There is little ef- 
fect on stochasticity, and both mocks and data are affected in the 
same way, making comparison practical. The nonlinearity reaches 



< 1% by around 35 Mpc with results for colour and rj classifica- 
tion barely distinguishable. A little care is needed in interpreting 
this result, however: negligible 'nonlinearity' does not mean that 
linear bias is a good fit. As much as anything, this is a statement 
that the amplitude of fluctuations declines for large L, so most cells 
have \5\ < 1. 

The stochasticity also declines, although on large scales the 
errors prevent distinction between a flat or declining function with 
scale. There is a tendency for the stochasticity of the r) datasets to 
lie a little below that of the colour datasets, but this is not significant 
within the errors. The dashed and dash-dot lines show the results for 
two semianalytic mock universes which will be discussed in detail 
in Section 7.2. We can immediately note the encouraging general 
agreement: stochasticity is clearly expected at about the detected 
level. 



6.2.4 Division by luminosity and redshift 

By splitting galaxies by their luminosity we can investigate whether 
the effects that we find in the previous Sections could be due to the 
luminosity difference of the galaxy types. By dividing galaxies at 
M — 5 log 10 (/i) = —19.5, we form two similar sized groups with 
class 1 being more luminous than class 2. The models are fitted 
as before by replacing E (L) with class 1 (2). In contrast to the 
outcome when galaxies are divided by type, the likelihood ratios 
between the best fitting power law and bivariate lognormal mod- 
els are small on all scales, ranging from to 3. We find tln to be 
roughly constant with a value of ~ 0.99 for the best fitting bivari- 
ate lognormal models. If stochasticity is caused by some variable 
other than the local density during galaxy formation, then perhaps 
luminosity is less dependent on this variable than galaxy type and 
colour. Other explanations could be that our volume limited sam- 
ple is too shallow to find the expected bimodality in luminosity, 
and the position of our boundary between bright and faint galaxies 
is arbitrary. 

It is also of interest to see if the results are independent of red- 
shift. We divide the survey at z — 0.09 and fit the models to both 
high and low redshift galaxies using a cell length of L — 20 Mpc. 
Table 5 shows the best fitting bivariate lognormal parameters for 
galaxies split by colour and rj for both redshift groups. Due to the 
fibre apertures of the 2dF instrument we may expect some redshift 
dependence for galaxies classified by rj (see Section 3.2.1), yet pre- 
cise predictions are difficult. We certainly see no difference within 
the errors between these two redshift groups, and the difference 
for colour classification can not be attributed to such effects. It is 
possible that the changing errors on the colour at high redshift con- 
tribute to the decrease in stochasticity, although evolution can not 
be ruled out. There is certainly room for further investigation with 
forthcoming larger redshift surveys. 
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Figure 12. Wedge plots of the 2dFGRS volume limited survey SGP region as for Fig.3. Both early and late type galaxies are shown. Overplotted are 
cells identified as causing the stochasticity signal, from top left: 10,15,20,25 Mpc cells. 



6.2.5 Comparison with other 2dFGRS results 

This work has been carried out in conjunction with that of Conway 
et al. (2004) who investigate the variance and deviation from lin- 
ear bias in the 2dFGRS NGP and SGP regions using flux-limited 
samples, including a counts-in-cells analysis. They find similar dis- 
crepancies between the Poisson sampled lognormal model and the 
data, investigating the causes and magnitude of the problem in de- 
tail. After accounting for this effect in both analyses the results 
agree within the 1 sigma errors where comparable: Conway et al. 
find l/6var = 1.25 ± 0.05, and nonlinearity (6/6) of a few per- 
cent on the smallest scales measured. Our results for 6 pow agree, 
but are consistently higher for 6u n . This is due to different fitting 
procedures; Conway et al. give greater weight to overdense regions. 

Madgwick et al. (2003) measure the square root of the ratio 
of the correlation functions of early and late type galaxies to be 
around 1.2 on their largest scales of 8 < r < 20 hT 1 Mpc. Their 
bias parameter corresponds to 1/ \fh in the notation of Section 2. 1 
(DL99). This gives a value for 6 a little higher than our results of 



Table 4, but entirely consistent when lognormal variance estimates 
are replaced by direct measures as in Section 6.2.3. 



6.3 Origin of the stochasticity signal 

Before the detection of stochastic bias is accepted, and we proceed 
to confront the result with theoretical models, a degree of skepti- 
cism is in order. We have seen that some regions of space have a 
number ratio of early and late type galaxies that differs from the 
typical value by too much to be consistent with Poisson scatter. 
Such an outcome seems potentially vulnerable to systematics in 
the analysis as any source of error in classifying galaxies could in- 
troduce an extra scatter, spuriously generating the impression of 
stochasticity. However, it is not clear which way this effect would 
go. Suppose the survey finds galaxies with perfect efficiency, but 
then assigns them a random class. Any true initial stochasticity is 
erased by the classification 'errors' and we measure tln = 1. In 
order to generate apparent stochasticity where none is present we 
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Figure 13. The colour distribution of cells identified as causing the stochasticity signal (L = 20 Mpc, filled line). The left (right) plot shows all those cells 
with an excess of red (blue) galaxies. The comparison plot (dashed line) is calculated from those cells with similar <5 E to the outlier cells, in order to account 
for nonlinearities. 



would need something more subtle. Possibilities could include a 
perfect efficiency in detecting early type galaxies, but a fluctuat- 
ing efficiency in finding late types; a spatially varying boundary 
between early and late types; or large variations in the survey se- 
lection function on scales smaller than the cell length. To assess the 
possible contribution of this latter effect to our measured stochastic- 
ity, we applied small scale incompleteness masks to semi-analytic 
datasets (see Section 7.2). The large scale stochasticity of these 
models was affected by less than 1 sigma. 

Whether or not a spurious generation of stochasticity seems 
plausible, it is worth looking more closely at the data to see how 
the signal arises. In order to do this, we focus on the outliers from 
the relation ln(l + 5l) oc ln(l + &), but a careful definition 
of an outlier is required. We want to ask how much the numbers 
(JVe, TVl) differ from their expectation values when clustering is 
included, but the latter are unknown. Therefore, we take the best- 
fitting power law model with tln = 1 and integrate over the dis- 
tribution of densities to calculate the probability for obtaining this 
outcome, P(Ne, TVl), accounting for Poisson noise. The most out- 
lying points are those with the lowest values of P, and we remove 
these in succession until the remaining cells are consistent with an 
t'ln = 1 model. The numbers of outliers in this sense are listed in 
Table 3, and Fig. 9 shows their positions on density-density plots. 

Having identified the cells that provide the evidence for 
stochasticity, we can examine their properties in more detail. Fig. 
12 shows the spatial distribution of the outlying cells within the 
2dFGRS for a range of cell sizes, from which it can be seen that 
they are often associated with overdense regions. This should not 
be taken as indicating that stochasticity is confined to such regions: 
given that the degree of stochasticity is small, the cells that con- 
tain the most galaxies will provide the best signal-to-noise for the 
effect. The colour distribution of galaxies in the outlying cells is 
shown in Fig. 13, compared to the distribution of 'normal' cells. To 
allow for nonlinearity in the density-density relation we consider 
for the comparison distribution only those cells with similar val- 



ues of <5e- The distributions cover sensible ranges of colours, and 
the peaks corresponding to early and late types appear to be in the 
correct places. What causes these cells to be outliers is that the ra- 
tio of the two populations differs greatly from what is typical, and 
it is hard to see how this result can be in error. The completeness 
values in these cells are typically 0.8, and yet we see variations in 
the early:late ratio by more than a factor 2. Moreover, similar vari- 
ations are seen whether we classify using colour or spectral type. 
We therefore conclude that these variations are a real property of 
the galaxy distribution. 

6.4 Consistency checks 

We repeat the analysis for cells with galaxies classified randomly, 
recovering a best-fitting bivariate lognormal model with txn = 1 
exactly. By fitting the bivariate lognormal model to Monte Carlo 
simulated power law mocks (Section 5.3), we can check for any 
bias inherent in our fitting procedures. The best fit models have 
mean t'ln > 0.998, not significantly different from the t'ln = 1 
of power law deterministic bias. 

6.5 Direct variance estimates 

It is possible to determine the variance o 2 (V) directly without 
assuming the lognormal model. Optimal power spectrum esti- 
mates perhaps provide the most accurate determinations of vari- 
ance (Tegmark et al. 2004; Pen et al. 2003), however for our pur- 
poses it suffices to use a simpler method presented by Efstathiou 
et al. (1990). Their estimator calculates AiV = N - (N) for each 
cell and subtracts the Poisson variance from (AN) 2 to form an es- 
timate of a 2 for each cell. This is then averaged over all cells. The 
estimator only applies in the case of a uniform survey, where (TV) 
is the same for all cells. For the general case of an incomplete sur- 
vey, Efstathiou et al. derive a slightly different estimator, assuming 
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a Gaussian density field. In fact, this is a poor assumption even on 
the largest scales considered here. We use Monte Carlo realisations 
of lognormal fields to show that their estimator for a is biased low 
by around 1-2%, and has an uncertainty often several times larger 
than that expected for the Gaussian model. Table 6 gives the direct 
variance estimates for our data, with errors from both Efstathiou 
et al. (1990) and Monte Carlo simulations. 

Even accounting for the bias, these direct variance estimates 
remain generally 10-20% lower than those estimated by fitting 
a Poisson sampled lognormal curve. For early type galaxies in 
L — 10 Mpc cells the discrepancy is nearly 40%. Imposing dif- 
ferent weighting schemes during fitting can lower the lognormal 
variance to meet the direct variance results (Conway et al. 2004), 
but these have no significant effect on our measurement of stochas- 
ticity. 

This failure of the lognormal model to recover the true vari- 
ance of the data may be due to the assumption of the Poisson Clus- 
tering Hypothesis which we know to be incorrect in detail. On the 
smallest scales our cells are largely shot noise dominated, and it is 
on these scales that the discrepancy is greatest (see also Section 
6.1.1). It remains important to emphasise that the variance esti- 
mates given throughout this paper are model dependent, and not to 
be taken as the true variance of the galaxies in the survey, which can 
be estimated more accurately through model independent methods. 

An unfortunate side effect of this difficulty in obtaining ac- 
curate estimates for variance, is that the average biasing statis- 
tics of Section 2.1 are dependent on a (see also Section 6.2.3). 
Tests show this to have little effect on stochasticity (o b /b), but 
on scales ^ 20 Mpc nonlinearity is overestimated. By replacing 
our measured variance with results obtained from bias corrected 
direct estimation we find nonlinearity to decrease to around 2% for 
L — 10 Mpc, decreasing with scale gradually to match our mea- 
sured values by L = 30 Mpc. Stochasticity is decreased by about 2 
sigma at L = 10 Mpc to around 0.39, but the effect is insignificant 
on all other scales. These results may be compared with the mea- 
surements of variance and deterministic bias in the 2dFGRS using 
flux-limited samples over a slightly larger volume (Conway et al. 
2004). 

This is a suitable point to discuss a subtlety of cell counts that 
we have neglected so far. Variation in the survey mask is repre- 
sented by (N) varying between cells. We have treated this as a 
simple variation in sampling efficiency that is uniform over the cell. 
However, this cannot be precisely correct: where sampling of a cell 
is low because it encounters one of the larger drills in the input cat- 
alogue, it would be more correct to assume a completely sampled 
cell of smaller volume. We have explored this alternative extreme 
by assuming that o oc (iV}~ ' 3 , as expected for a £(r) oc r~ 1,8 
spectrum. Since (N) oc completeness, and the typical cell com- 
pleteness is about 0.8, the measured values of o are increased by 
about 7 per cent, approximately a 1-cr shift. This has no effect on 
our detection of stochasticity, and because the 'lost volume' as- 
sumption will not apply in all cases, we neglect the issue. Note that 
estimates of cell variances derived from integration over correla- 
tion functions or power spectra would be completely independent 
of this issue. 



7 COMPARISON WITH SIMULATIONS 

In order to interpret our measurements of stochastic bias, we need 
to make a comparison with theory. In practice, this means consider- 
ing the results of numerical simulations that are sufficiently detailed 



to predict the spatial distributions of the different classes of galaxy. 
There are currently two main methods of simulating the large scale 
structure of the visible universe: semianalytic or hydrodynamic. We 
consider each in turn. 

7.1 Previous work 

Somerville et al. (2001) used semianalytic models to measure the 
relative bias between early and late type galaxies (as defined by 
bulge to total luminosity) and red and blue galaxies on scales of 
r — 8 h^ 1 Mpc. They set a limiting magnitude of Mb — 5 log h ^ 
— 18.4, and split galaxies by colour at B — R = 0.8, mak- 
ing their samples reasonably comparable to our data for cells of 
L = 20 Mpc. Our value of 6 var = 0.66 (Table 4) falls in be- 
tween their values of 0.77 for late/early types and 0.55 for blue/red 
galaxies. They find rn n = 0.87 for both subgroups, slightly lower 
than our values for both colour and spectral type. Unfortunately the 
results are not split into stochasticity and nonlinearity, making it 
difficult to make further comparisons. It is however interesting that 
we find a lower amplitude of relative bias between the two colour 
groups than is seen in these models. 

The hydrodynamic simulations of Yoshikawa et al. (2001) 
classify galaxies by their formation redshift, and are smoothed with 
top hat spheres of radius Sh^ 1 Mpc. By using this classification 
scheme, hydrodynamic models approximate early type galaxies as 
those that form at high redshifts via initial starbursts, whereas late 
type galaxies have a lower formation redshift and undergo slower 
star formation. They find that old galaxies are positively biased with 
respect to matter with a linear correlation coefficient of less than 
1, whereas young galaxies are slightly antibiased with a correla- 
tion coefficient closer to 1 . They measure the relative bias between 
galaxy types by b^ 1 = (£youn g /£oid) 1/2 where £ yoU ng (£ id) is the 
two-point correlation function of the young (old) galaxies. This is 
equivalent to &n n . They obtain values of between 0.5 and 0.66 for 
scales of 1 h^ 1 Mpc < r < 20 h^ 1 Mpc, lower than our equiv- 
alent values for the linear biasing model with L < 25 Mpc cells 
(Table 3). Once again results for stochasticity and nonlinearity are 
not quoted for the relative bias. 

7.2 Preliminary mock comparison 

None of this past work really allows a direct comparison with 
our results, so we generated two new theoretical 'datasets' from 
the results of large semianalytic calculations carried out using the 
'Cosmology machine' supercomputer at Durham. The background 
model is that deduced from the simplest WMAP+2dFGRS analysis 
of Spergel et al. (2003): flat, Q, m = 0.27, Q, b = 0.045, h = 0.72, 
n — 0.97, as = 0.8, applying the semianalytic apparatus of Cole 
et al. (2000) to a simulation with N = 500 3 particles in a box 
of side 250 ft, -1 Mpc. As shown by e.g. Benson et al. (2003), a 
problem faced by such modelling is a tendency to over-produce 
massive galaxies, as a result of excessive cooling arising from the 
higher baryon density now required by CMB+LSS. This problem 
is particularly severe for disk (late type) galaxies. The first mock 
adopted the 'superwind' approach of Benson et al. (2003) in an 
attempt to alleviate this problem, but the cure is not total. The sec- 
ond mock attempted to reduce cooling by retaining the low baryon 
density of Cole et al. (2000). Although this conflicts with CMB 
data, it provides a useful means of comparison. For this applica- 
tion, we took an empirical approach in which a monotonic shift in 
luminosity was applied to force the models to have the observed lu- 
minosity function as in Madgwick et al. (2002). The model colour 
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Table 6. Different variance estimates and errors for early and late datasets defined by colour, (a) From our bivariate lognormal model fit with 
errors derived from a multidimensional Gaussian fit to the likelihood surface; (b) Efstathiou et al. (1990) direct variance estimator and errors; 
(c) direct variance estimator after using Monte Carlo simulations of lognormal fields to correct for bias due to non-Gaussianity, with rms 
errors from the simulations. 
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distribution was bimodal to a realistic degree, so this shift was ap- 
plied separately to generate model distributions of early and late 
type galaxies in which the global luminosity functions were cor- 
rect. The resultant mock cell counts were analysed identically to 
the real data. 

In some respects, these simulations match the real data very 
well. For the low-baryon model, the amplitude of the cell variances 
for early-type galaxies agree to within 3% on small scales and 10% 
on large scales. The superwind model variances agree to within 
10% on all scales. The relative bias of the low-baryon model agrees 
to within 10 and 15% with observation, and the superwind model 
to within 10 and 20%. Significant stochasticity and nonlinearity is 
also required, which can be measured accurately as a function of 
scale because we are able to use more mock cells than are available 
in the real data. The mocks are affected in a similar manner to the 
data by the discrepancy between direct estimates of variance and 
those from lognormal model fits. On small scales this significantly 
increases our estimated nonlinearity; as the effect is equivalent be- 
tween mocks and data, however, a direct comparison between them 
remains instructive. Fig. 10 shows the resulting stochasticity and 
nonlinearity as a function of scale, compared to that of the 2dF- 
GRS data. The impression is that the mock results show a greater 
nonlinearity than the real data on small scales, while stochasticity 
is well matched within the errors. 

Given the known imperfect nature of the semianalytic simu- 
lations (e.g. the failure to match luminosity functions exactly), the 
correct attitude is probably to be encouraged by the degree of agree- 
ment with the data. It is certainly plausible that the existing calcu- 
lations contain all the relevant physical contributions to bias, but 
perhaps not yet in quite the right proportions. As usual with such 
numerical comparisons, this raises the question of whether the issue 
of stochasticity can be understood in a more direct fashion. In the 
end, the effects we are seeing must be reducible to the way in which 
the early:late ratio varies between and within virialized systems of 
different mass, so that in effect we are dealing with a more general 
version of the morphological segregation that is familiar from the 
study of rich clusters (Narayanan, Berlind & Weinberg 2000). We 
intend to pursue this in more detail elsewhere, using the catalogue 
of galaxy groups derived from the 2dFGRS by Eke et al. (2004). 



8 SUMMARY AND CONCLUSIONS 

We have presented fits of three relative biasing schemes to joint 
counts-in-cells distributions of 2dFGRS galaxies, separated by both 
colour and spectral type r\. Each scheme is convolved with a Pois- 



son distribution to account for statistical 'shot noise'. Our first two 
models present two alternative types of deterministic biasing: lin- 
ear and power law bias. Linear bias is an important concept in cos- 
mology and many results are linked to it, but it is not physically 
plausible as it allows negative densities. Power law bias presents a 
simple cure for this problem, but still has little physical motivation. 
With the advent of large semianalytic and hydrodynamic simula- 
tions, interest has grown in 'stochastic' bias models. Bias could be 
determined by parameters other than the local overdensity of the 
dark matter, and considerable scatter could occur in the relation. 
Galaxy distributions have previously been measured to be well ap- 
proximated as lognormal, therefore a bivariate lognormal distribu- 
tion seems a natural model for relative bias between galaxy types. 
This model incorporates stochasticity and nonlinearity in a well de- 
fined manner, which is mathematically simple and consistent with 
observation. 

To account for the discrete nature of galaxies, the Poisson sam- 
pling hypothesis is assumed, and all models are convolved with a 
Poisson distribution. On small scales where our cell counts become 
shot noise dominated, we find this hypothesis to fail causing over- 
estimates of variance compared to direct estimation methods. The 
main symptom of the discrepancy is a number of completely empty 
cells that exceeds the Poisson sampled lognormal prediction. This 
is found not to affect our results for stochasticity, and the same 
effect is seen in the simulations, but it emphasises the need for 
a greater understanding of Poisson statistics in relation to galaxy 
clustering. 

We have detected a significant deviation from tln = 1 in the 
2dFGRS and confirmed this detection of stochasticity through like- 
lihood ratio tests, Kolmogorov-Smirnoff probability testing, and 
Monte Carlo simulations. We have measured stochasticity at a level 
of a b /b = 0.44 ± 0.02 or r LN = 0.958 ± 0.004 on the smallest 
scales (10 Mpc), declining with increasing cell size. The nonlinear- 
ity of the biasing relation is less than 5% on all scales. The small 
measured values of stochasticity and nonlinearity support the use 
of galaxy redshift surveys for studies of the large scale distribution 
of matter in the universe, and the measurement of cosmological pa- 
rameters. However, as precision in cosmology increases and new 
techniques are developed, the effects of stochastic bias on param- 
eter estimation should be understood. For example, studies of cos- 
mology through weak gravitational lensing requires knowledge of 
nonlinear and stochastic bias (Seljak & Warren 2004). Our results 
for rii n on 10 Mpc scales are consistent within the (large) errors 
with galaxy-mass correlations measured by weak lensing surveys 
(Hoekstra et al. 2002) on the largest scales probed. 

A comparison with semianalytic simulations shows a similar 
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variation of nonlinearity and stochasticity with scale. The ampli- 
tude of stochasticity appears a little lower than in the true data, 
particularly on large scales, and the nonlinearity is slightly greater 
on small scales. Nevertheless, given the known imperfections of the 
current generation of semianalytic calculations, the general agree- 
ment is certainly encouraging. We hope that this work will stimu- 
late the investigation of more detailed biasing models. Through the 
linking of new simulations to observations, a more thorough under- 
standing of the processes of galaxy formation and evolution should 
be within our reach. 
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The conditional probability distribution for the bivariate lognormal 
model is given by equation (33) 
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where g, = ln(l + &) - (ln(l + &)), (ln(l + &)) = -w?/2 
and §i = Qi/uJi. 02I01 follows a univariate Gaussian distribution 
with mean t*ln 0*1 and variance 1 — ?~l N . The covariance matrix 
V and correlation coefficient tln are both defined in log space, 
and are given explicitly in equations (31) and (32). The variance of 
the distribution in linear space of is related to the variance of the 
Gaussian field by 

al = (5 2 i) = expK 2 ] - 1. (A3) 
On substituting equation (A2) into (Al) and integrating we 



find 
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From this basic parameter we can calculate the mean biasing and 
its nonlinearity [equations (8) and (9)] 



b = 



(b(8i)Sj) _ exp[r L N^2C^i] 
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We also know the ratio of variances, equation (15) 
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Although it is possible to derive the scatter for this model from 
equation (12), it can be shown that (DL99) 



(A8) 



(A9) 



1.2 T2 . 2 

Ovar =0 + V h . 

Using this fact we obtain 

a 2 = expju|] - expfr^ul] 
b exp[ci; 2 ] — 1 

Whilst the bivariate lognormal model contains constant scatter 
dependent only on tln in the log frame, transformation to the linear 
frame causes the the scatter to become dependent on the widths of 
the univariate distributions and vary with 81 . 



APPENDIX A: NON-LINEAR AND STOCHASTIC BIAS 
STATISTICS FOR THE BIVARIATE LOGNORMAL 
DISTRIBUTION 

The general biasing relation between the overdensities 81 and 82 of 
two subgroups of galaxies, or two types of matter, is fully described 
by equation (7) 
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